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We proposed a probabilistic algorithm to solve the Multiple Sequence Alignment problem. The 
algorithm is a Simulated Annealing (SA) that exploits the representation of the Multiple Alignment 
between D sequences as a directed polymer in D dimensions. Within this representation we can 
easily track the evolution in the configuration space of the alignment through local moves of low 
computational cost. At variance with other probabilistic algorithms proposed to solve this problem, 
our approach allows for the creation and deletion of gaps without extra computational cost. The al- 
gorithm was tested aligning proteins from the kinases family. When D — 3 the results are consistent 
with those obtained using a complete algorithm. For D > 3 where the complete algorithm fails, we 
show that our algorithm still converges to reasonable alignments. Moreover, we study the space of 
solutions obtained and show that depending on the number of sequences aligned the solutions are 
organized in difi'erent ways, suggesting a possible source of errors for progressive algorithms. 



PACS numbers: 87.10.-|-e,87.15.cc,87.14.-g 
I. INTRODUCTION 

The Multiple Sequence Alignment (MSA) problem 
constitutes one of the fundamental research areas in 
Bioinformatics. While at first sight it may seem a sim- 
ple extension of the two-string alignment problem two 
strings good, four strings better, for biologists, the mul- 
tiple alignment of proteins or DNA is crucial in deduc- 
ingtheir common properties P|. Quoting Arthur Lensk 
QHS ■ One or two homologous sequences whisper. . . a full 
multiple alignment shouts out loud. 

In general, the sequences consist of a linear array of 
symbols from an alphabet of /c-letters (A: = 4 for DNA 
and k = 20 for proteins). Given D sequences to deter- 
mine a good Multiple Sequence Alignment is a relative 
task. Usually one defines a score function that depends 
on the distances between the letters of the alphabet, and 
assumes that the better alignment is the one that mini- 
mizes this score function. 

It is a common use to define the MSA score in terms 
of the scores of the pairwise global alignments of the se- 
quences (Sum of Pairs score) 0. Given two sequences 
a = ai . . . Om and 6 = ai ... 6m let A(a, b) be a cost of the 
mutation of a into b and 7 the cost of inserting or delet- 



ing of the letter. Extending A(a, b) so that A(a, — ) = 7 
and A (—,6) = 7 and considering that a null (-) sym- 
bol isolated from others (-) pays an extra cost 5 Q we 
may define the score of a pairwise alignment Mij for 
sequences Oi and bj of size m as: 

m 

s{M,^j) = J2A{a,,h,bj.h) + nS (1) 

h=l 

where n is the number of isolated (-). Then, the score 
for the multiple alignment M is given by: 

EiM,,,) = Y,s{M,^,) (2) 

The multiple sequence alignment has at least three im- 
portant applications in Biology: classification of protein 
families and superfamilies, the identification and repre- 
sentation of conserved sequences features of both DNA 
or proteins that correlate structure and/or function and 
the deduction of the evolutionary history of the sequences 
studied 0,11. 

Unfortunately the problem is known to be NP- 
complete and no complete algorithm exist to solve real or 
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random instances. Therefore, many heuristic algorithms 
have been proposed to solve this problem. The algorithm 
of Carrillo-Lipman |^ (which is complete) , is a Dynamic 
Programming algorithm able to find the multiple align- 
ment of 3 sequences, and with some heuristic added, to 
find the alignments, in reasonable time, of up to 6 se- 
quences 0. However, its computational cost scales very 
fast with the number of sequences and is of little utility 
for more ambitious tasks. In the first 90's the problem 
was approached using ideas coming from physics, J. Kim 
and collaborators and M. Ishikawa and collaborators 
used different version of the Simulated Annealing ^ 
technique with some success, but their algorithms were 
unable to change the number of gaps in the alignment. 
This means that once they started with a given initial 
configuration (usually taken from some heuristics), any 
motion of segments in the sequences conserved the num- 
ber of gaps. To extend these programs allowing the num- 
ber of gaps to change will cause the appearance of global 
moves in the algorithm that are very expensive from the 
computational point of view. 

Probably the must successful attenipt to solve this 
problem has been the Clustal project j^], a progressive 
algorithm that first organizes the sequences according to 
their distances and then aligns the sequences in a pro- 
gressive way, starting with the most related ones. More- 
over, it uses a lot of biological information, some motifs 
of residues rarely accept gaps, sub-sequences of residues 
associated with structural sub-units are preferred to stay 
together during the alignment, etc. These features, and 
a platform easy to use and integrated with other stan- 
dard bioinformatic tools, have made Clustal the favorite 
Multiple Sequence Alignment program for biologists and 
people doing bionformatics in general [To| . However, it 
also has important drawbacks. Once the first k sequences 
are aligned, the inclusion of a new sequence would not 
change the previous alignment, the gap penalties are the 
same independently on how many sequences have been 
already aligned or their properties, and being a progres- 
sive method the global minimum obtained is strongly bi- 
ased by those sequences which are more similar 0. 

Another, recent and also successful approach uses the 
concepts of Hidden Markov Models 11]. While some of 
the previous drawback associated to Clustal disappear, 
because for example, the sequences do not need to be 
organized a priori, one most start assuming a known 
model of protein (or DNA) organization, which is usu- 
ally obtained after training the program in a subset of 
sequences. Then, one must be aware that the results 
usually depend on the training set, specially if it is not 
too large. Moreover, if we are dealing with sequences 
of unknown family, or difficult to be characterized this 
approach does not guarantee good alignments. 

Therefore we decided to propose a new Simulated An- 
nealing (SA) algorithm that avoids the main difficulty 
of the previous attempts Q. Our algorithm allows 
for the insertion and deletion of gaps in a dynamic way 
using only local moves. It makes use of the mathemat- 



ical mapping between the Multiple Sequence Alignment 
and a Directed Polymer in a Random Media (DPRM) 
pointed out some years ago by Hwa et al such 
a way, it should be also possible to extrapolate all the 
computer facilities and techniques, developed in the field 
of polymers to this biological problem. 

The rest of the paper is organized in the following way. 
In the next section we make a short review of the the- 
oretical foundations of our algorithm. Then in section 
mil we explain the implementation details to discuss the 
results in section IIVI Finally the conclusions are pre- 
sented including and outlook for future improvements of 
this program. 

II. THEORETICAL BACKGROUND 

Usually, multiple sequence alignments are studied and 
visualized writing one sequence on the top of the other, 
miming a table, (see figure ^ and all the probabilistic 
algorithms devised so far use the simplicity of this repre- 
sentation to generate the moves. 



FIG. 1: Usual representation of a multiple sequence alignment 

Instead of that, we will use the well known fact [MH^ 
that the alignment of D sequences may be represented in 
a D dimensional lattice (see figure |21 for D — 2). 

The cells of the /^-dimensional lattice are labeled by 
the D indexes . . .iu). The bonds encode the ad- 

jacency of letters: A diagonal bond in a D dimensional 
space represents the Z?-pairing (a^ ^ , , . . . , Wi^ ) . The 
insertion of gaps are represented by bonds without com- 
ponents in the sequences where the gaps were inserted. 
For example, a /^-pairing (a.^j , , — , di^, . . . , ) is rep- 
resented by a bond whose projection on the third se- 
quence is zero, and the D-pairing (a^^ , — , — , , . . . , Wi^ ) 
is represented by a bond whose projection on the second 
and third sequences are zero. 

Then, any alignment maps onto a lattice path that is 
directed along the diagonal of the Z3-dimensional hyper- 



Seq 1 FHELEKIGSGEFGSVFKCVKRLDGCIYAIK-RSKKPLA.G 

Seq 2 YSILKQI6S66SSKVFQVLNEKKQ-IYAIKYVNLEEADN 

Set? 3 SIIDEVVGQGAFATVKKAIERTTGKTFAVKIISKRKVIGN 

Seq 1 R-YFSAWA-EDDHMLI-QNEYCNGGSLADAISENYRIM 

Seq 2 DYEITD-QYIYM-V-ME-CGNIDLNSWLKKKKSIDP-WE 

Seq 3 FYEDTESYY— -MVME-FVSGGDLMDFVAAHG-AVG-E 

Seq 1 HSMSLVHMDIKPSN-I-FISRTSIPNAASEEGD-EDDWA 

Seq 2 DLKPANFLIVD-G-ML-KLI-DFGIA-N-QMQPDTTS-VV- 

Seq 3 DLKPDNILIEQDDPVLVK— IT-DFGLAKVQGNGSFMK-T- 

Seq 1 EGD-SRFLANEVLQENYTHLPKADI-FALALTVV-CAAG 

Seq 2 -NGKSKSK-I3PKSDVWSLGCI-LYYMTYGK-TPFQQIIN 

Seq 3 YEERN— E-YSSLVDMWSMGCLVYVILTGHLPFSGSTQ 

Seq 1 QVLSQEFTE-LLKVMI-HPDPERRPSAM 

Seq 2 QDVLKCCL-KRDPKQRISIPELLAHPYV 

Seq 3 — EARD-FIDS-LLQVDPNNRSTAAKALN 
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cube. This lattice path may be interpreted as a Directed 
Polymer and the Random Media in the problem is pro- 
vided by the structure of the sequences to be aligned and 
by the distance between the residues in the different se- 
quences. 
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FIG. 2; A directed path (thick line) in a bi-dimensional grid 



This mapping was already fruitfully used by Hwa and 
CO. 10 to prove that the similarities between two se- 
quences can be detected only if their amount exceeds 
a threshold value and for proposing a dynamic way to 
determine the optimal parameters for a good alignment 
of two sequences. 

Here, our main focus will be to optimize the Directed 
Polymer (lattice path) under the constraints imposed by 
the sequences and their interactions in dimensions larger 
than 2. To use a Simulated Annealing algorithm we ex- 
tend the usual representation of computer science of de- 
termining the ground state of the problem to a finite tem- 
perature description. Then, a finite-temperature align- 
ment is a probability distribution 



P(C) = Ae-/5i^(C) (3) 
/j 

over all possible conformations C of the polymer and 
where E is given by equation |21 and Z is the partition 
function of the alignment [T^. The temperature (/3~^) 
controls the relative weight of alignments with differ- 
ent scores (different conformations of the polymer) while 
A(a, 6) the length of the polymer and the frequency of 
the gaps. In physical terms, P{C) defines and ensem- 
ble at temperature I3~^ with line tension 7 and chemical 
potential A(a,5).[lJ| 



III. THE ALGORITHM 

The Simulated Annealing (SA) was introduced many 
years ago by Kirkpatrick et al 8] to find a global min- 
imum of a function in combinatorial optimization prob- 
lems. SA is a probabilistic approach, that in general 
needs a state space (the different configurations of the 
directed path) and a cost or energy function ^ to be 
minimize (eq. 

Simulated Annealing generates new alignments from a 
current alignment by applying transition rules of accep- 
tance. The criteria for acceptance are the following: 

• if AE < 0, accept the new alignment 

• if AE > accept it with probability P{AE) = 

g-/3(iS„e„(C)-_Eo,d(C)) 

The parameter /3 controls the probability to accept 
a new configuration. Initially, one starts at low values 
of P (high temperatures) and then increases it apply- 
ing an annealing schedule. If the temperature is lowered 
slowly enough, it can be proved that the system reaches 
a global minimum [l5l | . Unfortunately it will require in- 
finitely computational time and one usually selects the 
best scheduling that is possible to afford with the com- 
putational facilities at hand. Then SA is run over many 
initial conditions, and one assumes that the output of 
minimum energy is (or is close) to the global minimum. 

In the Multiple Sequence Alignment this is also the 
case, but differently to what happens in other combina- 
torial optimization problems, here the average solution, 
i.e. that obtained after averaging over all the local min- 
ima's may be interesting by itself. In fact, researchers are 
often interested not in the particular details of the align- 
ment, but in its robust properties, and comparing all the 
outputs of the SA is a way to get this information. 

From the technical point of view, once a cost function 
is defined, one needs to select the moves to be associated 
to the transition rates. Our description of the Multiple 
Sequence Alignment Problem as a Directed Polymer in a 
Random Media allows us to define three types of moves, 
insertion, deletion and motion of gaps. All these moves 
are represented in figure in a two-dimensional grid. 
The extension to I?-dimensional systems is straightfor- 
ward. 

In this way we get an algorithm that allows for the 
creation of gaps, which means a search space larger than 
the usually proved by similar methods. At the same time 
the algorithm is quadratic in the number of sequences. In 
fact, the computational cost of any move is limited by the 
square of the number of sequences to be aligned. 

In this work, wc did not follow any heuristic strategy 
of optimization. Our intention was to prove the poten- 
tiality of this strategy and we kept things as simple as 
possible. For example, if we start too far from the global 
minimum, the selection of local moves alone will make the 
algorithm to converge very slowly to it. This drawback 
may be overcome using very different initial conditions or 
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take care that differently to what is usually obtained in 
other algorithms for the Multiple Sequence Alignment 
problem, these figures reflects averages values of the mul- 
tiple sequence alignment. 



FIG. 3: Local moves of the algorithm in a two dimensional 
grid (from arrows to dashed lines), a) gap insertion, b) gap 
motion and c) gap deletion 



trying, every some time steps, global moves that change 
radically the conformation of the polymer. We did not 
take care of this. During the simulation the three moves 
were chosen with probability 1/3. The only biological in- 
formation inserted was given by the cost matrix used to 
align the protein sequences. We avoid the use of impor- 
tant and well know biological information, fixed residues, 
phylogenetic tree of the sequences, etc, and the program 
was designed without the use programming optimization 
tricks. 



IV. RESULTS 

All the results presented in this section reflects the 
alignments of proteins from the kinase family, but qual- 
itatively similar results were obtained for the GPCRs 
(G protein-coupled receptors) and CRP (cAMP receptor 
protein) families. The simulations started with /3 — 1.0 
and every trei montecarlo steps P was increased by a mul- 
tiplicative factor of 1.01 until /3 — 3.0. We took care and 
in all cases the system reached the equilibrium. The dif- 
ferent initial conditions were chosen inserting gaps ran- 
domly in all the sequences but the larger one, such that 
considering these gaps at i = 0, all the sequences were of 
equal length. To define the distance between the letters 
of the alphabet we use the PAM matrix 'Tl'|. 

In figure^are shown the approach to equilibrium of the 
multiple sequence alignment of 3 sequences averaged over 
100 initial conditions for Uei = 10, 100, 1000 and 10000. 
Comparing with the result of the Carrillo-Lipman algo- 
rithm it is evident that for trei = 10000 the algorithm is 
very close to the global minimum. However, one should 




FIG. 4: Mean Energy versus time for the alignment of three 
sequences from the kinase family. From left to right: trei = 
10, 100, 1000 and 10000. The average were taken over 100 
initial conditions. The horizontal line represents the result of 
the Carrillo-Lipman algorithm 



In fact, one is often interested, rather than on the aver- 
age, in the minimum of all the alignments. In figure|Slwe 
represent an histogram in energy for the alignment, over 
1000 initial conditions for different values of trel^ of the 
same three proteins of the kinase family presented in fig- 
ure 0] Note again that for trd — 1000 we obtain exactly 
the Carillo-Lipman result with probability larger than 0. 
Moreover, looking to the structure of the histogram for 
trei = 1000, one can also conjecture that if the average 
multiple sequence alignment were calculated only with 
those alignments concentrated in the peak of lower ener- 
gies, the result presented in figure 0] will be closer to the 
one of the Carrillo-Lipman algorithm. 

Another symptom suggesting that the average over the 
realization must be taken with care comes from the anal- 
ysis of figure El There we present again histograms for 
trei — 1000 but using three different samples of the ki- 
nase family. Note that while sample 1 and sample 3 are 
very well behaved and the results compare very well with 
the Carillo-Lipman method, the situation for sample 2 is 
different. To get good results in this case, it is clearly 
necessary to go beyond 1000 montecarlo steps. 

In the same spirit of figure 0] figures [7| and |H1 reflect 
results suggesting that also for higher dimensions, if t^ei 
is large enough the algorithm should produce good align- 
ments. In fact, also in these cases the energy decreases 
linearly with t^ei ■ 

If the number of sequences is higher, the correla- 
tions between the sequences increases, and the algorithm 
should flnd better results. This fact may be clearly seen 
in flgure ini where the equilibration time of the algorithm 
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FIG. 5: Histograms of energy for the alignment of three se- 
quences from the kinase family at different trei- 



FIG. 7: Mean Energy versus time for the alignment of nine 
sequences from the kinase family. From left to right: trei = 
10, 100 and 1000. The average were taken over 100 initial 
conditions. 
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FIG. 6: Histograms of energy for the alignment of three dif- 
ferent samples of three sequences from the kinase family at 
trei = 1000. The Carrillo-Lipman results are represented by 
the arrows. 
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FIG. 8: Mean Energy versus time for the alignment of 18 
sequences from the kinase family. From left to right: trei ~ 
10, 100 and 1000. The average were taken over 100 initial 
conditions. 



(in m.c.s) is shown as a function of the number of se- 
quences. The equihbration time, measured as the time 

necessary to reduce the energy by a factor e, decreases 
linearly with the number of sequences to be ahgned. Of 
course the results may change if very different sequences 
are aligned, but in this case all other known algorithms 
fail to predict good alignments. Then, we may say that 
for the most common cases, of correlated sequences, we 
present an algorithm whose convergence time decreases 
with D, and whose moves, only increase quadratically 
with D. 

With these results at hand, we follow to study the 
structure of the space of the solutions as a function of 
the number of aligned sequences. Wc define a distance 
(d) between two alignments A, and B in the following 
way. Given two solutions, Aij and Bij (where the index 
i stands for the sequence and j for the position of the 
symbol in the sequence) we, aligned one by one of the 



D sequences of each solution using a Dynamic Program- 
ming algorithm reminiscent of the Needleman-Wunsh al- 
gorithm with the following score function: 

r if Aij = Bij 
Ca{A,B) = I 1 HAij^Bij 

I r > 1 if a gap is inserted 

and express d^^s as: 

hi 

In this way identical alignments will be at distance 
from each other, and the insertion of gaps to obtain good 
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FIG. 9; Equilibration time versus D average over 500 initial 
conditions for trei = 1000. 



FIG. 10: E ~ Eo versus d for the alignment of 3,4 and 5 
sequences of the kinase family 



alignments is penalized, such that the original alignments 
are altered minimally during the calculation of We 
calculated d with r ~ 2 and r = 8 and the results were 
the same (apart a constant shift in d) . Below we present 
figures for r = 2. 

We study how different are, from the solution of min- 
imum score, the other solutions obtained aligning D se- 
quences. For every value of D we use 1000 initial con- 
ditions and a trei = 1000. Note that the sequences used 
were always the same. This mean, we first aligned three 
sequences from the kinase family. Then, to align 4 se- 
quences we just add a new one to the previous three and 
started the alignment from scratch. The procedure was 
repeated for every new value of D. 

The results appear in figureslTUlandlTTlwhere E~Emm 
is plotted as a function of d. From the figures it becomes 
evident that the space of solution strongly depends on 
the number of sequences aligned. For example, while for 
three sequences the distance between the alignments is 
correlated with the difference in score, for more than 4 
sequences it is not true anymore. Moreover, while for 
D < 6 the distance between the solutions decreases, it 
increases for D > Q and remains constant for D > 12. 
Surely, these results reflect the correlation between the 
proteins aligned. For example, we may speculate that 
for 3 < Z? < 6, any new protein added contributed to 
find more similar alignments, i.e. added relevant infor- 
mation to the system. However, for D > 6 the sequences 
contributed with new and uncorrelated information that 
produce more distant alignments and for _D > 12 to add 
new proteins that belong to the kinase family will not 
change the relevant characteristics of the alignment. 

These results are relevant, considering the limitations 
of progressive algorithms to change the previous align- 
ment when new sequences are added. Figures ITUl and ITTI 

clearly suggest that the inclusion of one single sequence 
may dramatically change the character of the solutions. 
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FIG. 11; E-Eo versus d for the alignment of 6,9,10,11,12,15 
and 18 sequences of the kinase family 



V. CONCLUSIONS AND OUTLOOK 

In this work we presented a new probabilistic algo- 
rithm, to perform the Multiple Alignment of proteins. 
The algorithm is based on the mapping between the 
DPRM and the Multiple Sequence Alignment problem. 
At variance with other probabilistic algorithms our al- 
gorithm permits the variation of the number of gaps in 
the alignment without the necessity of expensive global 
moves. It is proved that for small number of sequences 
it reproduces the results of a complete algorithm. More- 
over, we show that for practical purposes the equilibra- 
tion time is almost independent on the number of se- 
quences aligned D and in the worst case, it scale linearly 
with D. Finally, we study the space of solutions for dif- 
ferent number of aligned sequences, and find a very rich 
structure that indicates the importance of just one se- 
quence in the multiple alignment. 

Of course the algorithm is still far from being compet- 



7 



itive with other approaches like HMM and CLUSTAL. 
We are already working in implementing a similar work 
but using Parallel Tempering instead of Simulated An- 
nealing. It is know that Parallel Tempering is more useful 
that S. A. when dealing with very hard problems, like spin 
glasses. Moreover, it is very suitable to parallelization. 
Also a direction of current work is the introduction of bi- 
ological information relevant for the alignment. This may 
impose important constraints in the possible alignments, 
that may in turn strongly reduce the space of possible 



solutions. And last, but not least, important program- 
ing optimization are necessary to make competitive this 
program from the computing time point of view. 
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